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Abstract 

Matrix completion and approximation are popular tools to capture a user’s preferences for 
recommendation and to approximate missing data. Instead of using low-rank factorization we 
take a drastically different approach, based on the simple insight that an additive model of 
co-clusterings allows one to approximate matrices efficiently. This allows us to build a concise 
model that, per bit of model learned, significantly beats all factorization approaches to matrix 
approximation. Even more surprisingly, we find that summing over small co-clusterings is more 
effective in modeling matrices than classic co-clustering, which uses just one large partitioning 
of the matrix. 

Following Occam’s razor principle suggests that the simple structure induced by our model 
better captures the latent preferences and decision making processes present in the real world 
than classic co-clustering or matrix factorization. We provide an iterative minimization algo¬ 
rithm, a collapsed Gibbs sampler, theoretical guarantees for matrix approximation, and excellent 
empirical evidence for the efficacy of our approach. We achieve state-of-the-art results on the 
Netflix problem with a fraction of the model complexity. 


1 Introduction 

Given users’ ratings of movies or products, how can we model a user’s preferences for different 
types of items and recommend other items that the user will like? This problem, often referred to 
as the Netflix problem, has generated a flurry of research in collaborative filtering, with a variety of 
proposed matrix factorization models and inference methods. Top recommendation systems have 
used thousands of factors per item and per user, as was the case in the winning submissions in 
the Netflix prize [11]. Recent state-of-the-art methods have relied on learning even larger, more 
complex factorization models, often taking nontrivial combinations of multiple submodels [17, 14]. 
Such complex models use large amounts of memory, are increasingly difficult to interpret, and are 
often difficult integrate into larger systems. 
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Figure 1: Accuracy of ACCAMS on Netflix, compared to [11] and [14], Note that our model 
achieves state of the art accuracy at a fraction of the model size. 


1.1 Linear combinations of attributes 

Our approach is drastically different from previous collaborative hltering research. Rather than 
start with the assumptions of a matrix factorization model, we make co-clustering effective for high 
quality matrix completion and approximation. Co-clustering has been well studied [2, 23] but was 
not previously competitive in large behavior modeling and matrix completion problems. To achieve 
state of the art results, we use an additive model of simple co-clusterings that we call stencils, rather 
than building a large single co-clustering. The result is a model that is conceptually simple, has 
a small parameter space, has interpretable structure, and achieves the best published accuracy for 
matrix completion on Netflix, as seen in Figure 1. 

Using a linear combination of co-clusterings corresponds to a rather different interpretation 
of user preferences and movie properties. Matrix factorization assumes that a movie preference 
is based on a weighted sum of preferences for different genres, with the movie properties being 
represented in vectorial form. For instance, if a user likes comedies but not romantic movies, then 
a romantic comedy may have a predicted neutral 3-star rating. 

Co-clustering on the other hand assumes there exists some “correct” partitioning of movies 
(and users). For instance, a user might be part of a group that likes all comedies but does not 
like romantic movies. Correspondingly, all romantic comedies might be aggregated into a cluster, 
possibly partitioned further into PG-13 rated, or R-rated romantic comedies. This quickly leads to 
a combinatorial explosion. 

By taking a linear combination of co-clusterings we beneht from both perspectives: there is 
no single correct partitioning of movies and users; however, we can use the membership in several 
independent groups to encode the factorial nature of attributes without incurring the cost of a 
necessarily high-dimensional model of matrix factorization. For instance, a movie may be {funny, 
sad, thoughtful}, it might have a certain age rating, it might be an {action, romantic, thriller, 
documentary, family} movie, it might be shot in a certain visual style, and by a certain group of 
actors. By taking linear combinations of co-clusterings we can take these attributes into account. 
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1.2 Stencils 


The mathematical challenge that motivated this work is that, in order to encode a rank-Zc matrix 
by a factorization, we need k numbers per row (and column) respectively. With linear combinations 
of stencils, on the other hand, we only need log 2 k bits per row (and column) plus 0{k‘^) floating 
point numbers regardless of the size of the matrix. 

We denote by a stencil a small k x k template of a matrix and its mapping to the row and 
column vectors respectively. This is best understood by the example below: assume that we have 
two simple stencils containing 3x3 and 3x2 co-clusterings. Their linear combination yields a 
rather nontrivial 9x9 matrix of rank 5. 



In contrast, classic co-clustering would require a (3 • 3) x (3 • 2) partitioning to match this structure. 
When we have s stencils of size k x k, this requires a k^ x k^ partitioning. 

By design our model has a parameter space that is an order of magnitude smaller than competing 
methods, requiring only s log 2 k bits per user and per movie and sk'^ floating point numbers, where 
k is generally quite small. This is computationally advantageous of course, but also demonstrates 
that our modeling assumptions better match real world structure of human decision making. 

Finding succinct models for binary matrices, e.g. by minimizing the minimum description 
(MDL), has been the focus of significant research and valuable results in the data mining com¬ 
munity [15, 12]. That said, these models are quite different. To the best of our knowledge, ours is 
the first work aimed at finding a parsimonious model for general (real-valued) matrix completion 
and approximation. 

1.3 Contributions 

Our paper makes a number of contributions to the problem of finding sparse representations of 
matrices. 

• We present ACCAMS, an iterative Zc-means style algorithm that minimizes the approximation 
error by backfitting the residuals of previous approximations. 

• We provide linear approximation rates exploiting the geometry of rows and columns of rating 
matrix via bounds on the metric entropy of Banach spaces. 

• We present a generative Bayesian non-parametric model and devise a collapsed Gibbs sam¬ 
pling algorithm, bACCAMS, for efficient inference. 

• Experiments confirm the efficacy of our approach, offering the best published results for 
matrix completion on Netflix, an interpretable hierarchy of content, and succinct matrix 
approximations for ratings, image, and network data. 
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We believe that these contributions offer a promising new direction for behavior modeling and 
matrix approximation. 

Outline. We begin by discussing related work from recommendation systems, non-parametric 
Bayesian models, co-clustering, and minimum description length. We subsequently introduce the 
simple /c-means style co-clustering and its approximation properties in Section 3. Subsequently, in 
Section 4.1 we define our Bayesian co-clustering model and collapsed Gibbs sampler for a single 
stencil. In Section 4.5 we extend our Bayesian model to multiple stencils. Section 5 reports our 
experimental results and we conclude with a discussion of future directions for the work. 

2 Related Work 

Recommender Systems. Probably the closest to our work is the variety of research on behavior 
modeling and recommendation. Matrix factorization approaches, such as Keren’s SVD-|—I- [10], 
have enjoyed great success in recommender systems. Recent models such as DFC [17] and LLORMA 
[14] have focused on using ensembles of factorizations to exploit local structure. 

More closely related to our model are Bayesian non-parametric approaches. For instance, [7, 6] 
use the Indian Buffet Process (IBP) for recommendation. In doing so they assume that each 
user (and movie) has certain preferential binary attributes. It can be seen as an extreme case of 
ACCAMS where the cluster size k = 2, while using a somewhat different strategy to handle cluster 
assignment and overall similarity within a cluster. Following a similar intuition as ACCAMS but 
different perspective and focus, [19] extended the IBP to handle k > 2 for link prediction tasks 
on binary graphs. Our work differs in its focus on general, real-valued matrices, its application of 
co-clustering, and its significantly simpler parameterization. 

A co-clustering approach to recommendation was proposed by [3] . This model uses co-clustering 
to allow for sharing of strength within each group. However, it does not overcome the rank-A: 
problem, i.e. while clustering reduces intra-cluster variance and improves generalization, it does 
not increase the rank beyond what a simple factorization model is capable of doing. Finally, [21] 
proposed a factorization model based on a Dirichlet process over users and columns. All these 
models are closely related to the mixed-membership stochastic blockmodels of [1]. 

Co-clustering. It was was originally used primarily for understanding the clustering of rows and 
columns of a matrix rather than for matrix approximation or completion [9]. This formulation 
was well suited for biological tasks, but it computationally evolved to cover a wider variety of 
objectives [2]. [20] defined a soft co-clustering objective akin to a factorization model. Recent work 
has defined a Bayesian model for co-clustering focused on matrix modeling [23]. [27] focuses on 
exploiting co-clustering ensembles, but do so by finding a single consensus co-clustering. As far as 
we know, ours is the first work to use an additive combination of co-clusterings. 

Matrix Approximation. There exists a large body of work on matrix approximation in the 
theoretical computer science community. They focus mainly on efficient low-rank approximations, 
e.g. by projection or by interpolation. Examples of the projection based strategy are [8, 4]. Es¬ 
sentially one aims to find a general low-rank approximation of the matrix, as is common in most 
recommender models. 

A more parsimonious strategy is to seek interpolative decompositions. There one aims to approx¬ 
imate columns of a matrix by a linear combination of a subset of other columns [16]. Nonetheless 
this requires us to store at least one, possibly more scaling coefficients per column. Also note the 
focus on column interpolations — this can easily be extended to row and column interpolations, 
simply by first performing a row interpolation and then interpolating the columns. To the best of 
our knowledge, the problem of approximating matrices with piecewise constant block matrices as 
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we propose here is not the focus of research in TCS. 

Succinct modeling. The data mining community has focused on finding succinct models of data, 
often directly optimizing the model size described by the minimum description language (MDL) 
principle [22]. Finding effective ways to compress real world data allows for better modeling and 
understanding of the datasets. This approach has led to valuable results in pattern and item-set 
mining [26, 15] as well as graph summarization [12]. However, these approaches typically focus on 
modeling databases of discrete items rather than real-valued datasets with missing values. 


3 Matrix Approximation 

Before delving into the details of Bayesian Non-parametrics we begin with an optimization view of 
ACCAMS. 


3.1 Model 

Key to our model is the notion of a stencil, an extremely easy to represent block-wise constant 
rank-A: matrix. 


Definition 1 (Stencil) A stencil S{T,c,d) is a matrix S G with the property that Sij = 

Tci,dj for a template T G and discrete index vectors c G {1,..., km}"^ and d G {1,..., kn}"' 

respectively. 


Given a matrix M G it is now our goal to find a stencil S{T, c, d) such that the approximation 

error M — S{T, c, d) is small while simultaneously the cost for storing T, c, d is small. In the context 
of the example above, the 9x9 matrix is given by the sum of two stencils, one of size 3x3 and 
one of size 3x2. This already indicates that we may require more than one stencil for efficient 
approximation. In general, our model will be such as to solve 


minimize 

[ThchS} 


M -'^S{T\c\d}) 
1=1 


2 


Frob 


( 1 ) 


That is, we would like to find an additive model of s stencils that minimizes the approximation 
error to M. Such an expansion affords efficient compression using a trivial codebook, as can be 
seen below. 


Lemma 2 (Compression) The stencil S{T,c,d) can be 
stored at e element-wise accuracy at no more cost than 

0{m log 2 km + n log 2 kn + kmkn log^e-^ \\T\U. 

Proof This follows directly from the construction. Storing the vector c costs at most m log 2 km 
bits if we assume a uniform code (this also holds for d). When storing T approximately, we must 
not quantize at a level of the approximation error or higher. Hence, a simplistic means of encoding 
the dynamic range requires log 2 llTlloo ^lit, which is used on a per-element basis in T. ■ 

Note that considerably better codes may exist whenever the entropy of c is less than log 2 km 
(likewise for d). Nonetheless, even the crude log 2 km bound is already much better than what can 
be accomplished by a low-rank factorization. 

Obviously, given M, it is our goal to find such stencils S{T,c,d) with good approximation 
properties. Unfortunately, finding linear combinations of co-clusterings is as hard as co-clustering: 
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assume that we are given all but one stencil of the optimal solution. In this case our problem 
reduces to co-clustering as its subproblem, which is NP-hard. 

3.2 Algorithm 

We consider a simple iterative procedure in which stencils are computed one at a time, using the 
residuals as input. The inner loop consists of a simple algorithm that is reminiscent of /c-means 
clustering. It proceeds in two stages. Without loss of generality we assume that we have more rows 
than columns, i.e. M G with m > n. 

Row clustering. We first perform fc-means clustering of the rows. That is, we aim to find an 
approximation of M that replaces all rows by a small subset thereof. Note that this is more stringent 
than the interpolative approximations of matrices which only require us to find a set of rows which 
will form a (possibly sparse) basis for all other rows. 


Algorithm 1 RowClustering 
Require: matrix M, row clusters km 

Draw km rows from M at random without replacement and copy them to {ui,..., Vk^}. 
t •«— 0 G and tc ^ 0 G 
while not converged do 

for i = 1 to m do 

Ci argmin^ \\Mi- — VjW^ {Find center} 
tci ^ tci + 1 {Increment cluster count} 

Wci Wci + Mi- {Increment statistics} 

end for 

for i = 1 to a do 

Vi ^ Wijti {New cluster center} 
li ■(— ti and •(— 0 {Cluster counts} 
rcj •<— (0,... 0) G {Reset statistics} 

end for 
end while 

return clusters {ui,... Vk^}, IDs c, counts I 


Algorithm I is essentially fe-means clustering on the rows of M. Once we have this, we now 
cluster the columns of the new matrix in an analogous manner. The only difference is that the 
approximation needs to be particularly good for row clusters that occur frequently. Consequently 
the approximation measure \\Mi- — is replaced by the Mahalanobis distance. That is, the only 
substantial difference to Algorithm 1 is that now we use the assignment 

di ■<— argmin {V-i — Wj)~^ D {Vi — Wj) (2) 

3 

where V G is the matrix obtained by stacking Vi- = Vi and D is the diagonal matrix of 

counts, i.e. Da = k. 

Missing entries. In many cases, however, M itself is incomplete. This can be addressed quite 
easily by replacing the assignment argmin^- ||Mj: — by 

Ci •(— argmin E {Mil - Vjif (3) 
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where we used {i,l) £ M as a shorthand for the existing entries in M. In finding a good cluster for 
the row Mu we restrict ourselves to the coordinates in Vj where Mu exists. 

Likewise, for the purpose of obtaining the column clusters, we now need to keep track for 
each coordinate in V how many elements in M contributed to it. Correspondingly denote by 
'■= Yl(e j)eM-ci,=i ^ ^he number of entries mapped into coordinate Vej- Then the assignment for 
column clusters is obtained via 


dj ^ argmin E {Vij - wij)^ tij 


(4) 


and likewise the averages are now per-coordinate according to the counts for both v and w. 
Backfitting. The outcome of row and column clustering is a stencil S{T,c,d) consisting of the 
clusters obtained by first row and then column clustering and of the assignment vectors c and d 
once the process is complete. It may be desirable to alternate between row and column clustering 
for further refinement. Since each step can only reduce the objective function further and the state 
space of (c, d) is discrete, convergence to a local minimum is assured, with the same caveat on 
solution quality as in fe-means clustering. The last step is to take the residual M — S{T,c,d) and 
use it as the starting point of a new approximation round. 


Algorithm 2 Matrix Approximation 
Require: matrix M, clusters km,kn-, max stencils s 
for iter = 1 to s do 

{V, c, Irow) ^ RowClustering(M, ka) 

(S', d, Zcoi) ColumnClustering(y, 6, diag(/row)) 
for all i,j G {1 ,... km} x {1,... kn} do 
Tij mean {Me/1Ce = i and df = j} 
end for 

M <— — S (T, c, d) and back up (T, c, d) 

end for 


Essentially the last stage is used to ensure that the stencil has minimum approximation error 
given the partitioning. This procedure is repeatedly invoked on the residuals to minimize the loss. 
The result is an additive model of co-clusterings. 

3.3 Approximation Guarantees 

A key question is how well any given matrix M can be approximated by an appropriate stencil. For 
the sake of simplicity we limit ourselves to the case where all entries of the matrix are observed. 
We use covering numbers and spectral properties of M to obtain approximation guarantees. 

Definition 3 (Covering Number) Denote by B a Banach Space. Then for any given set B € B 
the covering number J\fe{B) is given by the set of points { 6 i,... 6 / 4 ( 5 )} such that for any b ^ B 
there exists some bj with ||6 — 6 /|| < e. 

Of particular interest for us are covering numbers A4 of unit balls and their functional inverses e^. 
The latter are referred to as entropy number and they quantify the approximation error incurred by 
using a cover of n elements [24, Chapter 8 ]. A key tool for computing entropy numbers of scaling 
operators is the theorem of Gordon, Konig and Schlitt [5], relating entropy numbers to singular 
values. 
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Theorem 4 (Entropy Numbers and Singular Values) Denote by D a diagonal sealing oper¬ 
ator with D : ip ^ ip with scaling coefficients ai > crj+i > 0 for all i. Then for aW n G N the 
entropy number en{D) is bounded via 

1 

< 6en{D) ( 5 ) 

This means that if we have a matrix with rapidly decaying singular values, we only need to focus 
on the leading largest ones in order to approximate all elements in the space efficiently. Here the 
tradeoff between dimensionality and accuracy is obtained by using the harmonic mean. 

Corollary 5 (Entropy Numbers of Unit Balls) The covering number of a ball B of radius r 
in l\ is bounded by 

rn~d < en{B) < 6rn“3. (6) 

Proof This follows directly from using the linear operator x —)• rx for x £ i^- Here the scaling 

operator has eigenvalues ai = r for all 1 < i < d and a^ = 0 for i > d. The maximum in (5) is 

always j = d. ■ 

The following theorem states that we can approximate M up to a multiplicative constant at any 
step, provided that we pick a large enough clustering. It also means that we get linear convergence, 
i.e. convergence in O(loge) steps to 0(e) error, since the bound can be applied iteratively. 

Theorem 6 (Approximation Guarantees) Denote by ui ,.. .an the singular values of M. Then 
using I elusters for rows and eolumns respeetively the matrix M ean be approximated with error at 
most 

||M-M'||^ <2||M||^q (es) 

||M- M'll^ < (^/m + ^/n) ||M||^ e; 

Here e/ is given by Theorem f and Corollary 5 respectively. 

Proof Using the singular value decomposition of M into M = UT,V we can factorize M = R 
where Q = T ,2 JJ and R = 'E 2 V. By construction, the singular values of Q and R are E 2 . We now 
cluster the rows of Q and R independently to obtain an approximation of M. 

For Q we know that its rows can be approximated by I balls with error ei S’S per Theorem 4. 

Also note that its row vectors are contained in the image of the unit ball under Q — if they were 
not, project them onto the unit ball and the approximation error cannot increase since the targets 
are within the unit ball, too. Hence the e^-cover of the latter also provides an approximation of 
the row-vectors of Q by Q' with accuracy e;, where Q' contains at most I distinct rows. The same 
holds for the matrix R, as approximated by R'. Hence we have 

\Mi, - (Q'„4)| = \{Qi:,Rr.) - {Qi..,R'r.)\ 

= \{Qn-Q'.,,Rr.) + {Q'.,Rr.-R'r.)\ 

< ||Qi: -Q',|| WRj-.W + ||Q',|| \\Rj.. - R'j,\\ 

<2\\Mpei (si) 
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en{D) < 6 sup 
i6N 


n 


■n 

2 = 1 



Figure 2: Generative model for recommdnation and matrix approximation (bACCAMS). For each 
stencil, as indexed by I, row and cluster memberships and are drawn from a Chinese Restaurant 
Process. The values for the template are drawn from a Normal Distribution. The observed 
ratings Mum are sums over the stencils S{T\c\d}). 


This provides a pointwise approximation guarantee.If we only have a bound on the rank and on 
||M||, this yields 

\Mi,-{q„rj)\ < 12||MP^ 

Moreover, since each row in Q and R respectively will be approximated with residual bounded by 
€i we can bound ||(5 — Q^|| < and ||R — i?'|| < y/nei respectively. This yields a bound on the 

matrix norm of the residual via 

\\[QR-Q'R']x\\ < \\Q{R - R')x\\ + \\{Q - Q')R'x\\ 

< {y/m + y/n) ||M||2 ei ||a:|| 

This bounds the matrix norm of the residual. ■ 

Note that the above is an existence proof rather than a constructive prescription. However, by 
using the fact that set cover is a submodular problem [28], it follows that given I, we are able to 
obtain a near-optimal cover, thus leading to a constructive algorithm. Note, however, that the main 
purpose of the above analysis is to obtain theoretical upper bounds on the rate of convergence. In 
practice, the results can be considerably better, as we show in Section 5. 

4 Generative Model 

In the same manner as many risk minimization problems (e.g. penalized logistic regression) have a 
Bayesian counterpart (Gaussian Process classification) [18], we now devise a Bayesian counterpart to 
AGGAMS, which we will refer to as bAGGAMS. We begin with the single stencil case in Section 4.1 
and extend it to many stencils in Section 4.5. 

4.1 Co-Clustering with a Single Stencil 

We begin with a simplistic model of co-clustering. It serves as the basic template for single-matrix 
inference. All subsequent steps use the same idea. In a nutshell, we assume that each user u belongs 
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to a particular cluster Cu drawn from a Chinese Restanrant Process CRP(a). Likewise, we assume 
that each movie m belongs to some cluster dm drawn analogously from CRP(/3). The scores of the 
matrix Mum are obtained from a stencil S{T, c, d)um = with additive noise €um ~ A/’(0, a'^). 

The stencil valnes Ted themselves are drawn from a normal distribntion AA(0,r^). In turn, the 
variances and are obtained via a conjugate prior, i.e. the Inverse Gamma distribntion. 

This is an extremely simple model similar to [23], akin to a decision stump. The rationale for 
picking such a primitive model is that we will be combining linear combinations thereof to obtain 
a very flexible tool. The model is shown in Figure 2. For I = 1 the formal definition is as follows: 


Cu ~ CRP(a) dm ~ CRP(/3) (7a) 

Ted ~ AA( 0 , r2) Mum ~ M {Te^du^ , CT^) (7b) 

~ IG( 7 ) ~ IG(7?) ( 7 c ) 

Recall that the Inverse Gamma distribution is given by 

p{x\a,b) = ( 8 ) 

and hence p(cr^|r/a, %) = e~ (9) 


Conseqnently the joint probability distribntion over all scores, given the variances is given by 

p{M,S{T,c,d)\a,/3,a‘^,T^) (10) 


=CRP(c|Q;)CRP(d|/3) TT exp 


cA 


TTT^ 


rp2 
^ cd 
'2r2 


n 

(u,m) 


V2 


: exp 


vrer^ 


(Mum Te^du 
2a^ 


The idea is that each user and each movie are characterized by a simple clnster. Since we chose all 
priors to be conjngate to the likelihood terms, it is possible to collapse ont the choice of Ted- This 
is particnlarly useful as it allows us to accelerate the sampler considerably — now we only sample 
over the discrete random variables Cu,dm indicating the clnster memberships for a particular user 
and movie. In other words, we obtain a closed form expression for p{M, c, d\a, /?, cr^, r^). Moreover, 
a‘^\S{T,c,d),p and r^|T ,7 are both Inverse Gamma due to conjugacy, hence we can sample them 
efficiently after sampling T. 


4.2 Inferring Clusters 


In the following we discuss a partially collapsed Gibbs sampler (effectively we use a Rao-Blackwellization 
strategy when sampling clnster memberships) for efficient inference. We begin with the part of sam¬ 
pling c\d, a, a^, T^. 

Chinese Restaurant Process: It is well known that for exponential families the conjugate dis¬ 
tribution allows for collapsing by taking ratios between normalization coefficients with and 
withont the additional sufficient statistics item. See e.g. [7, Appendix A] for a detailed 
derivation. Denote by ni,mj the number of users and movies belonging to clusters i and j 
respectively. Moreover, denote by n and m the total nnmber of users and movies, and by 
kn, km the number of clusters. In this case we can express 


and hence p{ci 


p{c\a) = 




r(a) 

r(a -I- n) 


t\c *,a) 


a+n—1 

a 

Q+n—1 


if * > 0 
otherwise 
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An analogous expression is available for p{d\l3) and p{dj = t\d~^, /3). Note that the superscript 
[•]”* denotes that the i-th observation is left out when computing the statistic. Large values 
of a and fd encourage the formation of larger numbers of clusters. The collapsed expressions 
will be useful for Gibbs sampling. 


Integrating out T : For faster mixing we need to integrate out T whenever we resample c and d. 
As we shall see, this is easily accomplished by keeping simple linear statistics of the ratings. 
Moreover, by integrating out T we avoid the problem of having to instantiate a new value 
whenever a new cluster is added. 


For a given block (c, d) with associated T^d, the distribution of ratings is Gaussian with mean 
0 and with covariance matrix (due to the independence of the variances and the 

additive nature of the normal distribution). Here we use 1 to denote the identity matrix and 
1 to denote the vector of all 1. Denote by ricd the number of rating pairs {u,m) for which 
Cu = c and dm = d. Moreover, denote by M^d the vector of associated ratings. Hence, the 
likelihood of the cluster block (c, d), as observed in Med is 


p(Mcd|cr^r^) = 


exp 


+ ^Med 


{ 2 tt ) 2 |cj^l + 2 

In computing the above expression we need to compute the determinant of a diagonal matrix 
with rank-1 update, and the inverse of said matrix. For the former, we use the matrix- 
determinant lemma, and for the latter, the Sherman-Morrison-Woodbury formula; 


M, 


cd 




1-1 


Med = 4 WMcdf - 4 • 

+ nedT 


log 


cr^l + = {Ued - 1) log 0-^ + log [cj^ -F 


This allows us to assess whether it is beneficial to assign a user m or a movie m to a different 
or a new cluster efficiently, since the only statistics involved in the operation are sums of 
residuals and of their squares. 


This leads to a collapsed Gibbs-sampling algorithm. At each step we check how likelihoods change 
by assigning a movie (or user) to another cluster. We denote by n'^ the new cluster count and by 


the new set of residuals. Let 


A := 


'^cd 


1 


[log(27r) -h logo-2] -h ^ \\Med\\ - \\Med\\ 


2 1 J ' 2( j 2 

be a constant offset, in log-space, that only depends on the additional ratings that are added to a 
cluster. In other words, it is independent of the cluster that the additional scores are assigned to. 
Hence A can be safely ignored. 


p{cu = c|-) oc 




a + n — 


iH 


d 


'cr^ -h U-edT^' 

1 

2 



{l^MedfX 

Cj 2 +n'^^T 2 _ 

GXp 

1 - 

to 1 

to| 

_Cj 2 + n'^^T 2 

0-2 -h ncdT'^\ 


For a new cluster this can be simplified since there is no data, hence Ued = 0 and Med = []• 


Pi^C-u — Cnewl') OC 


a-\-n— 


in[ 



1 

(J^ 

2 

exp 



r 2 


E 




( 11 ) 


( 12 ) 


The above expression is fairly straightforward to compute; we only need to track Ued, i-e. the 
number of ratings assigned to a particular (user cluster, movie cluster) combination and Med, 
i.e. the sum of the scores for this combination. 
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4.3 Inferring Variances 

For the purpose of recommendation and for a subsequent combination of several matrices, we need 
to infer variances and instantiate the scores T^d- By checking (10) we see that Tc^|rest is given by 
a normal distribution with parameters 


Tcrfirest ~ Af 


pncd ’ pn J 


where p 


1 + 


ncd 1-2 


(13) 


Note that the term p plays the role of a classic shrinkage term just as in a James-Stein estimator. 
To sample cr^ and we use the Inverse Gamma distribution of (8). 

Denote by E the total number of observed values in M. In this case, is drawn from an 
Inverse Gamma prior with parameters {Pajilh)'- 

and ^ r/;, + ^ {Mum - Tc^druf- (14) 


Analogously, we draw from an Inverse Gamma with parameters 

huA' 


7a 


7a + 


and 7 ^, 


c^d 


2 

cd 


(15) 


kn and km denote the number of user and movie clusters. 


4.4 Efficient Implementation 

With these inference equations we can implement an efficient sampler, as seen in Algorithm 3. The 
key to efficient sampling is to cache the per-cluster sums of ratings Med- Then reassigning a user 
(or movie) to a different (or new) cluster is just a matter of checking the amount of change that 
this would effect. Hence each sampling pass costs 0{kn ■ km - {n + m) + E) operations. It is linear 
in the number of ratings and of partitions. 

Note that once Uud and lud are available for all users (or all movies), it is cheap to perform 
additional sampling sweeps at comparably low cost. It is therefore beneficial to iterate over all 
users (or all movies) more than once, in particular in the initial stages of the algorithm. Also note 
that the algorithm can be used on datasets that are being streamed from disk, provided that an 
index and an inverted index of M can be stored: we need to be able to traverse the data when 
ordered by users and when ordered by movies. It is thus compatible with solid state disks. 

4.5 Additive Combinations of Stencils 

If there was no penalty on the number of clusters it would be possible to approximate any matrix 
by a trivial model using as many clusters as we have rows and columns. 

Lemma 7 Any matrix M G fiQg nonvanishing support in (10) regardless of . 

Proof Since any partitionings of sets of size m, n respectively have nonzero support, it follows 
that partitioning all rows and all columns into separate bins is possible. Hence, we can assign a 
different mean ped to any entry Mum- ® 

Obviously, the GRP prior on c and d makes this highly unlikely. On the other hand, we want to 
retain the ability to fit a richer set of matrices than what can be effectively covered by piecewise 
constant block matrices. We take linear combinations of matrices, as introduced in Section 3. 
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Algorithm 3 StencilSampler 

Initialize row-index and column-index of data in M 
Initialize sum of squares Q := Yl(um) ^um 
Initialize statistics for each partition 

ricd ■= {{u, m) : Cu = c, dm = d} and led ■= ^ Mum 

{u,m):Cu=c,dm=d 


while sampler not converged do 

for all users u do 

For all movie clusters d compute the incremental changes 

Uud ■= {{u,m) : dm = d} and lud ■= ^ Mum 

(u^m):dm=d 


Remove u from their cluster 

Remove u from their cluster ^ Icud — n-ud 

Sample new user cluster c„ using (11) and (12). 

Update statistics 

^Cud ^ ^Cud flud and Ic-ad ^ ^Cud flud 


end for 

for all movies m do 

Sample movie cluster assignments analogously. 

end for 

for all (c, d) cluster partitions do 

Resample Ted using (13) and the statistics Uedjed- 

end for 

Resample and via the Inverse Gamma distribution using (14) and (15). 

end while 


As before, we enumerate the stencils by S{T^,c\d^). Correspondingly we now need to sample 
from a set of S{T^,c\d^) and per matrix. However, we keep the additive noise term Af{0,a‘^) 
unchanged. This is the model of Figure 2. The additivity of Gaussians renders makes inference 
easy: 

Mum ~ ^ • ( 16 ) 

Note, though, that estimating S jointly for all indices I is not tractable since various clusterings 
{c\d^) overlap and intersect with each other, hence the joint normal distribution over all variables 
would be expensive to factorize. 

Instead, we sample over one stencil at a time, as shown in Algorithm 4. This algorithm only 
requires repeated passes through the dataset. Moreover, it can be modified into a backfitting 
procedure by fitting one matrix at a time and then fixing the outcome. Capacity control can be 
enforced by modifying a and /3 such that the probability of a new cluster decreases for larger I, i.e. 
by decreasing a and /3. As a result following the analysis in the single stencil case, each sampling 


13 





Algorithm 4 bACCAMS 

initialize residuals p := M and = 0 
while sampler not converged do 

for all stencils I do 

Compute partial residuals p <— p — S{T\ c\d^) 
Sample over S (T^ iC^S) using p instead of M 
Update residuals with p ■<— p + S{T\c\ d}) 

end for 
end while 


pass costs 0{s ■ {kn ■ km • (n + m) + E)) operations where s is the number of stencils. It is linear in 
the number of ratings, in the number of partitions and in the number of stencils. 


5 Experiments 

We evaluate our method based on its ability to perform matrix completion, matrix approximation 
and to give interpretable results. Here we describe our experimental setup and results on real world 
data, such as the Netflix ratings. 

5.1 Implementation 

We implemented both ACCAMS, the A:-means-based algorithm, as well as bACCAMS, the Bayesian 
model. Unless specified otherwise, we run the RowClustering of Algorithm 1 for up to T = 50 
iterations. Our system can also iterate over the stencils multiple times, such that earlier stencils 
can be re-learned after we have learned later ones. In practice, we observe this only yields small 
gains in accuracy, hence we generally do not use it. 

We implemented bACCAMS using Gibbs sampling (Section 4.5) and used the fc-means algorithm 
ACCAMS for the initialization of each stencil. Following standard practice, we bound the range of 
a by Umax from above. This rejection sampler avoids pathological cases. For the sake of simplicity, 
we set k = kn = km to be the maximum number of clusters that can be generated in each stencil. 
When inferring the cluster assignments for a given stencil, we run three iterations of the sampler 
before proceeding to the next stencil. As common in MCMC algorithms, we use a burn-in period 
of at least 30 iterations (each with three sub-iterations of sampling cluster assignments) and then 
average the predictions over many draws. Code for both ACCAMS and bACCAMS is available at 
http://alexbeutel.com/accams. 

5.2 Experimental Setup 

Netflix. We run our algorithms on data from a variety of domains. Our primary testing dataset 
is the ratings dataset from the Netflix contest. The dataset contains lOOM ratings from 480k users 
and 17k movies. Following standard practice for testing recommendation accuracy, we average over 
three different random 90:10 splits for training and testing. 

CMU Face Images. To test how well we can approximate arbitrary matrices, we use image data 
from the CMU Face Images dataset^. It contains black and white images of 20 different people, 

Rttp://goo.gl/FsoX5p 


14 








each in 32 different positions, for a total of 640 images. Each image has 128 x 120 pixel resolution; 
we flatten this into a matrix of 640 x 15360, i.e. an image by pixel matrix. 

AS Peering Graph. To assess our model’s ability to deal with graph data we consider the AS 
graph^. It contains information on the peering information of 13,580 nodes. It thus creates a binary 
matrix of size 13, 580 x 13, 580 with 37k edges. Since our algorithm is not designed to learn binary 
matrices, we treat the entries {0,1} as real valued numbers. 

Parameters. For all experiments, we set the hyperparameters in bACCAMS to a = /3 = 10, 
r]a = 2, rjp = 0.3, ja = 5, and 7 ^ = 0.3. Depending on the task, we compare ACCAMS against 
SVD++ using the GraphChi [13] implementation, SVD from Matlab for full matrices, and previ¬ 
ously reported state-of-the-art results. 

Model complexity. Since our model is structurally quite different from factorization models, we 
compare them based on the number of bits in the model and prediction accuracy. For factorization 
models, we consider each factor to be a 32 bit float. Hence the complexity of a rank R SVD-|—|- 
model of n users and m movies is 32 • R(n + m) bits. 

For ACCAMS with s stencils and k x k co-clusters in each stencil, the cluster assignment for a 
given row or column is log 2 k bits and each value in the stencil is a float. As such, the complexity 
of a model is s((n -|- m) log 2 k + 2)2 ■ k"^) bits. 

In calculating the parameter space size for LLORMA, we make the very conservative estimate 
that each row and column is on average part of two factorizations, even though the model contains 
more than 30 factorizations that each row and column could be part of. 

5.3 Matrix Completion 

Since the primary motivation of our model is collaborative filtering we begin by discussing results 
on the classic Netflix problem; accuracy is measured in RMSE. To avoid divergence we set fTmax = 1- 
We then vary both the number of clusters k and the number of stencils s. 

A summary of recent results as well as results using our method can be found in Table I. Using 
CraphChi we run SVD-|—|- on our data. We use the reported values from LLORMA [14] and DFC 
[17], which were obtained using the same protocol as reported here. 

As can be seen in Table 1, bACCAMS achieves the best published result. We achieve this 
while using a very different model that is significantly simpler both conceptually and in terms of 
parameter space size. We also did not use any of the temporal and contextual variants that many 
other models use to incorporate prior knowledge. 

As shown in Figure 1, we observe that per bit our model achieves much better accuracy at a 
fraction of the model size. In Figure 3(a) we compare different configurations of our algorithm. As 
can be seen, classic co-clustering quickly overfits the training data and provides a less fine-grained 
ability to improve prediction accuracy than ACCAMS. Since ACCAMS has no regularization, it 
too overfits the training data. By using a Bayesian model with bACCAMS, we do not overfit 
the training data and thus can use more stencils for prediction, greatly improving the prediction 
accuracy. 

5.4 Matrix Approximation 

In addition to matrix completion, it is valuable to be able to approximate matrices well, especially 
for dimensionality reduction tasks. To test the ability of ACCAMS to model matrix data we analyze 
both how well our model fits the training data from the Netflix tests above as well as on image data 

^http://topology.eecs.umich.edu/data.html 


15 




Method 

Parameters 

Size 

Test RMSE 

SVD++ [13] 

R = 25 

49.8MB 

0.8631 

DFC-NYS [17] 

Not reported 

0.8486 

DFC-PROJ [17] 

Not reported 

0.8411 

LLORMA [14] 

R = 1 

3.98MB 

0.9295 

LLORMA [14] 

R = 5, a > 30 

19.9MB 

0.8604 

LLORMA [14] 

R = 10, a > 30 

39.8MB 

0.8444 

LLORMA [14] 

R = 20, a > 30 

79.7MB 

0.8337 

ACCAMS 

k = W, s = 13 

2.69MB 

0.8780 

ACCAMS 

k = 100, s = 5 

2.27MB 

0.8759 

bACCAMS 

A: = 10, s = 50 

10.4MB 

0.8403 

bACCAMS 

A: = 10, s = 70 

14.5MB 

0.8363 

bACCAMS 

A = 10, s = 125 

25.9MB 

0.8331 


Table 1: bACCAMS achieves an accuracy for matrix completion on Netflix better than or on-par 
with the best published results, while having a parameter space a fraction of the size of other 
methods, a denotes the number of anchor points for LLORMA and sizes listed are the parameter 
space size. 


from the CMU Faces dataset and a binary matrix from the AS peering graph. (Note, for Netflix we 
now use the training data from one split of the dataset.) For each of these of datasets we compare 
to the SVD (or SVD-|—|- to handle missing values). We also use our algorithm to perform classic 
co-clustering by setting s = 1 and varying k. 

As can be seen in Figure 3(b-d), ACCAMS models the matrices from all three domains much 
more compactly than SVD (or SVD-|--|- in the case of the Netflix matrix, which contains missing 
values). In particular, we observe on the CMU Faces matrix that ACCAMS uses in some cases 
under | of the bits as SVD for the same quality matrix approximation. Additionally, we observe 
that using a linear combination of stencils is more efficient to approximate the matrices than 
performing classic co-clustering where we have just one stencil. Ultimately, although the method 
was not designed specifically for image or network data, we observe that our method is effective for 
succinctly modeling the data. 

5.5 Interpretability 

In any model the structure of factors makes assumptions about the form of user preferences and 
decision making. The fact that our model achieves high-quality matrix completion with a smaller 
parameter space suggests that our modeling assumptions better match how people make decisions. 
One side effect of our model being both compact and conceptually simple is that we can understand 
our learned parameters. 

To test the model’s interpretability we use ACCAMS to model the Netflix data with s = 20 
stencils and = 100 clusters (a model of similar size to a rank-3 matrix factorization). Here we 
look at two ways to interpret the results. 

First we view the cluster assignments in stencils as inducing a hierarchy on the movies. That 
is, movies are split in the first level based on their cluster assignments in the first stencil. At the 
second level, we split movies based on their cluster assignments in the second stencil, etc. In Figure 
4 we observe the hierarchy of TV shows induced by the first three stencils learned by ACCAMS 
(we only include shows where there is more than one season of that show in the leaf and we pruned 
small partitions due to space restrictions). 
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Figure 3: On images, ratings, and binary graphs, ACCAMS approximates the matrix more 
efficiently than SVD, SVD++, or classic co-clustering. 


As can be seen in the hierarchy, there are branches which clearly cluster together shows more 
focused on male audiences, female audiences, or children. However, beyond a first brush at the 
leaf nodes, we can notice some larger structural differences. For example, looking at the two large 
branches coming from the root, we observe that the left branch generally contains more recent TV 
shows from the late 1990s to the present, while the right branch generally contains older shows 
ranging from the 1960s to the mid 1990s. This can be most starkly noticed by “Friends,” which 
shows up in both branches; Seasons 1 to 4 of “Friends” from 1994-1997 fall in the older branch, 
while Seasons 5 to 9 from 1998-2002 fall in the newer branch. Of course the algorithm does not 
know the dates the shows were released, but our model learns these general concepts just based 
on the ratings. From this it is clear the stencils can be useful for breaking down content in a 
meaningful structured way, something that is not possible under classic factorization approaches. 

While the hierarchy demonstrates that our stencils are learning meaningful latent factors, it 
may be difficult to always understand individual clusters. Rather, to use knowledge from all of the 
stencils, we can look to the use case of “Users who watched X also liked V,” and ask given a movie 
or TV show to search, can we find other similar items? We do this by comparing the set of cluster 
assignments from the given movie to the set of cluster assignments of other items. We measure 
similarity between two movies using the Hamming distance between cluster assignments. 

As can be seen in Table 5, we find the combination of clusters for different movies and TV 
shows can be used to easily find similar content. While we see some obvious cases where the 
method succeeds, e.g. “Sex and the City” returns six more seasons of “Sex and the City,” we 
also notice the method takes into account more subtle similarities of movies beyond genre. For 
example, while the first season of “Seinfeld” returns the subsequent seasons of “Seinfeld,” it is 
followed by three seasons of “Curb Your Enthusiasm,” another comedy show by the same writer 
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Figure 4: Hierarchy of TV Shows on Netflix based on the first three stencils generated by ACCAMS. 


2001: A Space Odyssey- 
Taxi Driver 
Chinatown 
Citizen Kane 
Dr. Strangelove 
A Clockwork Orange 
THX 1138: Special Edition 
Apocalypse Now Redux 
The Graduate 
Blade Runner 
The Deer Hunter 
Deliverance 


Sex and the City: Season 1 
Sex and the City: Season 2 
Sex and the City: Season 3 
Sex and the City: Season 4 
Sex and the City: Season 5 
Sex and the City: Season 6.1 
Sex and the City: Season 6.2 
Hercules: Season 3 
Will &: Grace: Season 1 
Beverly Hills 90210: Pilot 
The O.C.: Season 1 
Divine Madness 


Seinfeld: Seasons 1 &: 2 
Seinfeld: Season 3 
Seinfeld: Season 4 
Curb Your Enthusiasm: Season 1 
Curb Your Enthusiasm: Season 2 
Curb Your Enthusiasm: Season 3 
Arrested Development: Season 1 
Newsradio: Seasons 1 and 2 
The Kids in the Hall: Season 1 
The Simpsons: Treehouse of Horror 
Spin City: Michael J. Fox 
Curb Your Enthusiasm: Season 4 


Mean Girls 

Clueless 

13 Going on 30 

Best in Show 

Particles of Truth 

Gharlie’s Angels: Full Throttle 

Amelie 

Me Myself I 

Bring it On 

Chaos 

Kissing Jessica Stein 

Nine Innings from Ground Zero 


Star ^Va^s: Episode V 

The Silence of the Lambs 

Scooby-Doo Where Are You? 

La-w Order: Season 1 

Star Wars: Episode IV 

Star Wars: Episode VI 

Battlestar Galactica: Season 1 

Raiders of the Lost Ark 

Star Wars: Clone Wars: Vol. 1 
Gladiator: Extended Edition 

Star Wars Trilogy: Bonus Material 
LOTR: The Fellowship of the Ring 
LOTR: The Two Towers 

Indiana Jones Trilogy: Bonus Material 
LOTR: The Return of the King 

The Sixth Sense 

Alien: Collector’s Edition 

The Exorcist 

Schindler’s List 

The Godfather 

Seven 

Colors 

The Godfather, Part II 
GoodFellas: Special Edition 
Platoon 

Full Metal Jacket 

The Flintstones: Season 2 

Classic Cartoon Favorites: Goofy 
Transformers: Season 1 (1984) 

Tom and Jerry: Whiskers Away! 

Boy Meets World: Season 1 

The Flintstones: Season 3 
Scooby-Doo’s Greatest Mysteries 

Care Bears: Kingdom of Caring 

Aloha Scooby-Doo! 

Scooby-Doo: Legend of the Vampire 
Rugrats: Decade in Diapers 

Law Sz Order: Season 3 

Law &c Order: SVU (2) 

Law &: Order: Criminal Intent (3) 
Law Sc Order: Season 2 

MASH: Season 8 

ER: Season 1 

MASH: Season 7 

Rikki-Tikki-Tavi 

The X-Files: Season 6 

The X-Files: Season 7 

ER: Season 3 


Figure 5: For a given movie or TV show on Netflix, we can use the cluster assignments to find 
related content. 


Larry David. Similarly, searching for Stanley Kubrick’s “2001: A Space Odyssey” returns other 
Stanley Kubrick movies, as well as other critically acclaimed films from that era, particularly 
thematically similar science fiction movies. Searching for “Scooby-Doo” returns topically similar 
children’s shows, specifically from the mid to late 1900’s. From this we get a sense that ACCAMS 
does not just find similarity in genre but also more subtle similarities. 

5.6 Properties of ACCAMS 

Aside from ACCAMS’s success across matrix completion and approximation, it is valuable to 
understand how our method is working, particularly because of how different it is from previous 
models. First, because ACCAMS uses backfitting, we expect that the first stencil captures the 
largest features, the second captures secondary ones, etc. This idea is backed up by our theoretical 
results in Section 3.3, and we observe that this is working experimentally by the drop off in RMSE 
for our matrix approximation results in Figure 3. We can visually observe this in the image 
approximation of the CMU Faces. As can be seen in Figure 6, the first stencil captures general 
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Figure 6: Examples of original images and the first two stencils. The decomposition is very similar 
to that of eigenfaces [25], albeit much more concise in its nature. 



Figure 7: Left: Entropy in cluster assignments for users and movies. Right: Stability of the 
assignments in the sampler. 


structures of the room and heads, and the second starts to fill in more fine grained details of the 
face. 

The Bayesian model, bACCAMS, backfits in the first iteration of the sampler but ultimately 
resamples each stencil many times thus loosening these properties. In Figure 7, we observe how 
the distribution of users and movies across clusters changes over iterations and number of stencils, 
based on our run of bACCAMS with s = 70 stencils and a maximum of A: = 10 clusters per stencil. 
As we see in the plot of entropy, movies, across all 70 stencils, are well distributed across the 10 
possible clusters. Users, however, are well distributed in the early stencils but then are only spread 
across a few clusters in later stencils. In addition, we notice that while the earlier clusters are 
stable, later stencils are much less stable with a high percentage of cluster assignments changing. 
Both of these properties follow from the fact that most users rate very few movies. For most users 
only a few clusters are necessary to capture their observed preferences. Movies, however, typically 
have more ratings and more latent information to infer. Thus through all 70 stencils we learn useful 
clusterings, and our prediction accuracy improves through s = 125 stencils. 
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6 Discussion 


Here we formulated a model of additive co-clustering. We presented both a fc-means style algorithm, 
ACCAMS, as well as a generative Bayesian non-parametric model with a collapsed Gibbs sampler, 
bACCAMS; we obtained theoretical guarantees for matrix approximation through additive co¬ 
clustering; and we showed that our method is concise and accurate on a diverse range of datasets, 
including achieving the best published accuracy on Netflix. 

Given the novelty and initial success of the method, we believe that domain-specihc variants of 
ACCAMS, such as for community detection and topic modeling, can and will lead to new models 
and improved results. In addition, given the modularity of our framework, it is easy to incorporate 
side information, such as explicit genre and actor data, in modeling rating data that should lead 
to improved accuracy and interpretability. 
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